10 research outputs found
Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization
Pretrained language models have achieved remarkable success in natural
language understanding. However, fine-tuning pretrained models on limited
training data tends to overfit and thus diminish performance. This paper
presents Bi-Drop, a fine-tuning strategy that selectively updates model
parameters using gradients from various sub-nets dynamically generated by
dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner,
so it overcomes the problem of hysteresis in sub-net updating, which is
possessed by previous methods that perform asynchronous sub-net estimation.
Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves
higher utility of training data. Experiments on the GLUE benchmark demonstrate
that Bi-Drop consistently outperforms previous fine-tuning methods.
Furthermore, empirical results also show that Bi-Drop exhibits excellent
generalization ability and robustness for domain transfer, data imbalance, and
low-resource scenarios.Comment: EMNLP 2023 Findings. Camera-ready version. Co-first authors with
equal contribution
A Survey on In-context Learning
With the increasing ability of large language models (LLMs), in-context
learning (ICL) has become a new paradigm for natural language processing (NLP),
where LLMs make predictions only based on contexts augmented with a few
examples. It has been a new trend to explore ICL to evaluate and extrapolate
the ability of LLMs. In this paper, we aim to survey and summarize the progress
and challenges of ICL. We first present a formal definition of ICL and clarify
its correlation to related studies. Then, we organize and discuss advanced
techniques, including training strategies, demonstration designing strategies,
as well as related analysis. Finally, we discuss the challenges of ICL and
provide potential directions for further research. We hope that our work can
encourage more research on uncovering how ICL works and improving ICL.Comment: Papers collected until 2023/05/2
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
Large pretrained language models have shown surprising in-context learning
(ICL) ability. With a few demonstration input-label pairs, they can predict the
label for an unseen input without parameter updates. Despite the great success
in performance, its working mechanism still remains an open question. In this
paper, we explain language models as meta-optimizers and understand in-context
learning as implicit finetuning. Theoretically, we figure out that Transformer
attention has a dual form of gradient descent. On top of it, we understand ICL
as follows: GPT first produces meta-gradients according to the demonstration
examples, and then these meta-gradients are applied to the original GPT to
build an ICL model. We comprehensively compare the behaviors of in-context
learning and explicit finetuning on real tasks to provide empirical evidence
that supports our understanding. Experimental results show that in-context
learning behaves similarly to explicit finetuning from multiple perspectives.
Inspired by the dual form between Transformer attention and gradient descent,
we design a momentum-based attention by analogy with gradient descent with
momentum. The improved performance over vanilla attention further supports our
understanding from another perspective, and more importantly, shows the
potential to utilize our understanding for future model design. The code is
available at \url{https://aka.ms/icl}.Comment: Accepted to ACL 2023 finding